model underperform
Where Does My Model Underperform? A Human Evaluation of Slice Discovery Algorithms
Johnson, Nari, Cabrera, Ángel Alexander, Plumb, Gregory, Talwalkar, Ameet
Machine learning (ML) models that achieve high average accuracy can still underperform on semantically coherent subsets (i.e. "slices") of data. This behavior can have significant societal consequences for the safety or bias of the model in deployment, but identifying these underperforming slices can be difficult in practice, especially in domains where practitioners lack access to group annotations to define coherent subsets of their data. Motivated by these challenges, ML researchers have developed new slice discovery algorithms that aim to group together coherent and high-error subsets of data. However, there has been little evaluation focused on whether these tools help humans form correct hypotheses about where (for which groups) their model underperforms. We conduct a controlled user study (N = 15) where we show 40 slices output by two state-of-the-art slice discovery algorithms to users, and ask them to form hypotheses about where an object detection model underperforms. Our results provide positive evidence that these tools provide some benefit over a naive baseline, and also shed light on challenges faced by users during the hypothesis formation step. We conclude by discussing design opportunities for ML and HCI researchers. Our findings point to the importance of centering users when designing and evaluating new tools for slice discovery.
Discovering the systematic errors made by machine learning models
In this blog post, we introduce Domino, a new approach for discovering systematic errors made by machine learning models. We also discuss a framework for quantitatively evaluating methods like Domino. Machine learning models that achieve high overall accuracy often make systematic errors on coherent slices of validation data. A slice is a set of data samples that share a common characteristic. As an example, in large image datasets, photos of vintage cars comprise a slice (i.e.
Latest Research From Stanford Introduces 'Domino': A Python Tool for Identifying and Describing Underperforming Slices in Machine Learning Models
Machine learning and Artificial Intelligence models have gained promising results in recent years. The major factor behind their success is the availability and development of vast datasets. However, regardless of how many terabytes of data you have or how skilled you are at data science, machine learning models will be useless and even dangerous if you can't make sense of data records. A slice is a collection of data samples with a common feature. For example, in a picture dataset, photographs of antique vehicles make up a slice.
Before you launch your machine learning model, start with an MVP
I've seen a lot of failed machine learning models in the course of my work. I've worked with a number of organizations to build both models and the teams and culture to support them. And in my experience, the number one reason models fail is because the team failed to create a minimum viable product (MVP). In fact, skipping the MVP phase of product development is how one legacy corporation ended up dissolving its entire analytics team. The nascent team followed the lead of its manager and chose to use a NoSQL database, despite the fact no one on the team had NoSQL expertise.